This code explores the variables inside an Environmental World dataset using principle components analysis (PCA) and screenplot. The data was compiled from Google Earth Engine and put in a nice data set format for easy manipulation. In this code, variables relating to temperature, rain, elevation, land cover type, cloudiness, and windiness will be assessed to determine correlations between the variables.
Venter, Zander. Miscellaneous environmental and climatic variables. Variable inofrmation obtained from the Google Earth Engine and data is provided on Kaggle. https://www.kaggle.com/zanderventer/environmental-variables-for-world-countries
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
# attach packages
library(tidyverse)
library(here)
library(janitor)
library(ggfortify)
library(patchwork)
library(plotly)
library(beepr)
world_env_vars <- read_csv(here("data", "world_env_vars.csv")) %>% # read in data
filter(Country != "Antarctica") %>% # remove the country Antartica as it only contains NA values
clean_names() %>% # convert column headers to lower snake case
drop_na() # drop any remaining NA values
# PCA Analysis
env_vars_pca <- world_env_vars %>% # define PCA subset
select(elevation, cropland_cover:isothermality, rain_mean_annual, temp_mean_annual, wind, cloudiness) %>% # include these variables only
scale() %>%
prcomp()
# env_vars_pca$rotation
# env_vars_pca$sdev
# Create a principle component plot
x <- autoplot(env_vars_pca, # plot pca data set
data = world_env_vars, # use original data set
loadings = TRUE,
repel = TRUE,
colour = "country", # color points by country
loadings.label = TRUE,
loadings.colour = "black", # color lines as black
loadings.label.colour = "black", # color labels as black
hide_legend = TRUE)
ggplotly(x) # plot the pca graph with plotly
Figure 1. Principle Component Analysis. This biplot is comparing different variables to see how related they are to each other. Crop-land cover is negatively correlated with isothermality and cropland cover and mean annual rain are not correlated. Due to the small angle between mean annual rain and tree canopy cover, it can be concluded that these two variables are positively correlated. The short line length for elevation indicates that it is intersecting in a different dimension of the graph. Points indicate each observation and is colored by country.
sd_vec <- env_vars_pca$sdev
var_vec <- sd_vec ^ 2
pc_names <- colnames(env_vars_pca$rotation)
pct_expl_df <- data.frame(v = var_vec, # create the data and variables to be plotted in screenplot
pct_v = var_vec / sum(var_vec),
pc = fct_inorder(pc_names)) %>%
mutate(pct_lbl = paste0(round(pct_v * 100, 1), "%"))
ggplot(pct_expl_df,
aes(x = pc, y = v)) +
geom_col(aes(fill = pc)) + # fill color by pc variable made above
geom_text(aes(label = pct_lbl), vjust = 0, nudge_y = 0.005) +
scale_fill_manual(values = c("PC1" = "darkolivegreen", # manually recolor all of the PCA columns
"PC2" = "darkolivegreen4",
"PC3" = "olivedrab",
"PC4" = "olivedrab3",
"PC5" = "yellowgreen",
"PC6" = "darkolivegreen3",
"PC7" = "darkolivegreen2",
"PC8" = "olivedrab2")) +
labs(x = "Principle Component", # add x-axis label
y = "Variance Explained", # add y-axis label
fill = "Principle Component") + # label legend
theme_minimal() # change theme
Figure 2. Screenplot of the different principle component variables. The first 4 principle components explain 89.8% of the variance, showing that it is reasonable to drop the other four principle component variables.
From this analysis, we have discovered that: